Evaluation Metrics for Health Chatbots: A Delphi Study

Kerstin Denecke; Alaa Abd-Alrazaq; Mowafa Househ; Jim Warren

doi:10.1055/s-0041-1736664

Methods of Information in Medicine, Table of Contents

Methods Inf Med 2021; 60(05/06): 171-179
DOI: 10.1055/s-0041-1736664

Original Article

Evaluation Metrics for Health Chatbots: A Delphi Study

Kerstin Denecke

¹School of Engineering and Computer Science, Institute for Medical Informatics, Bern University of Applied Sciences, Biel, Switzerland

,

Alaa Abd-Alrazaq

²Division of Information and Computing Technology, College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar

,

Mowafa Househ

²Division of Information and Computing Technology, College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar

,

Jim Warren

³Faculty of Science, School of Computer Science, University of Auckland, Auckland, New Zealand

› Author Affiliations

Abstract

Background In recent years, an increasing number of health chatbots has been published in app stores and described in research literature. Given the sensitive data they are processing and the care settings for which they are developed, evaluation is essential to avoid harm to users. However, evaluations of those systems are reported inconsistently and without using a standardized set of evaluation metrics. Missing standards in health chatbot evaluation prevent comparisons of systems, and this may hamper acceptability since their reliability is unclear.

Objectives The objective of this paper is to make an important step toward developing a health-specific chatbot evaluation framework by finding consensus on relevant metrics.

Methods We used an adapted Delphi study design to verify and select potential metrics that we retrieved initially from a scoping review. We invited researchers, health professionals, and health informaticians to score each metric for inclusion in the final evaluation framework, over three survey rounds. We distinguished metrics scored relevant with high, moderate, and low consensus. The initial set of metrics comprised 26 metrics (categorized as global metrics, metrics related to response generation, response understanding and aesthetics).

Results Twenty-eight experts joined the first round and 22 (75%) persisted to the third round. Twenty-four metrics achieved high consensus and three metrics achieved moderate consensus. The core set for our framework comprises mainly global metrics (e.g., ease of use, security content accuracy), metrics related to response generation (e.g., appropriateness of responses), and related to response understanding. Metrics on aesthetics (font type and size, color) are less well agreed upon—only moderate or low consensus was achieved for those metrics.

Conclusion The results indicate that experts largely agree on metrics and that the consensus set is broad. This implies that health chatbot evaluation must be multifaceted to ensure acceptability.

Keywords

health chatbots - conversational agents - performance measures - evaluation framework - Delphi study

Full Text

References

References
1 McTear MF, Callejas Z, Griol D. The conversational interface: Talking to Smart Devices. Springer; 2016
2 Jungmann SM, Klan T, Kuhn S, Jungmann F. Accuracy of a Chatbot (Ada) in the diagnosis of mental disorders: comparative case study with lay and expert users. JMIR Form Res 2019; 3 (04) e13863
3 Tschanz M, Dorner TL, Holm J. et al. Using eMMA to manage medication. Computer 2018; 51: 18-25
4 Siangchin N, Samanchuen T. Chatbot Implementation for ICD-10 Recommendation System. Paper presented at: 2019 International Conference on Engineering, Science, and Industrial Applications (ICESI); 2019
5 Abd-Alrazaq AA, Alajlani M, Alalwan AA, Bewick BM, Gardner P, Househ M. An overview of the features of chatbots in mental health: a scoping review. Int J Med Inform 2019; 132: 103978
6 Abd-Alrazaq AA, Rababeh A, Alajlani M, Bewick BM, Househ M. Effectiveness and safety of using Chatbots to improve mental health: systematic review and meta-analysis. J Med Internet Res 2020; 22 (07) e16021
7 Laranjo L, Dunn AG, Tong HL. et al. Conversational agents in healthcare: a systematic review. J Am Med Inform Assoc 2018; 25 (09) 1248-1258
8 Vaidyam AN, Wisniewski H, Halamka JD, Kashavan MS, Torous JB. Chatbots and conversational agents in mental health: a review of the psychiatric landscape. Can J Psychiatry 2019; 64 (07) 456-464
9 Kocaballi AB, Berkovsky S, Quiroz JC. et al. The personalization of conversational agents in health care: systematic review. J Med Internet Res 2019; 21 (11) e15360
10 Abd-Alrazaq A, Safi Z, Alajlani M, Warren J, Househ M, Denecke K. Technical metrics used to evaluate health care chatbots: scoping review. J Med Internet Res 2020; 22 (06) e18301
11 Maroengsit W, Piyakulpinyo T, Phonyiam K. et al. A Survey on Evaluation Methods for Chatbots. Paper presented at: Proceedings of the 2019 7 ^th International Conference on Information and Education Technology; 2019 Aizu-Wakamatsu, Japan:
12 Walker MA, Litman DJ, Kamm CA. et al. PARADISE: a framework for evaluating spoken dialogue agents. Paper presented at: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics; 1997 Madrid, Spain:
13 Miner AS, Milstein A, Hancock JT. Talking to machines about personal mental health problems. JAMA 2017; 318 (13) 1217-1218
14 Sillice MA, Morokoff PJ, Ferszt G. et al. Using relational agents to promote exercise and sun protection: assessment of participants' experiences with two interventions. J Med Internet Res 2018; 20 (02) e48
15 Zhang J, Oh YJ, Lange P, Yu Z, Fukuoka Y. Artificial intelligence Chatbot behavior change model for designing artificial intelligence Chatbots to promote physical activity and a healthy diet. J Med Internet Res 2020; 22 (09) e22845
16 Shneiderman B, Plaisant C, Cohen M, Jacobs S, Elmqvist N. Designing the User Interface: Strategies for Effective Human-Computer Interaction. 6th ed.. Boston: Pearson; 2018
17 Tractinsky N, Katz AS, Ikar D. What is beautiful is usable. Interact Comput 2000; 13 (02) 127-145
18 Inkster B, Sarda S, Subramanian V. An empathy-driven, conversational artificial intelligence agent (Wysa) for digital mental well-being: real-world data evaluation mixed-methods study. JMIR Mhealth Uhealth 2018; 6 (11) e12106
19 Hensher M, Cooper P, Dona SWA. et al. Scoping review: development and assessment of evaluation frameworks of mobile health apps for recommendations to consumers. J Am Med Inform Assoc 2021; 28 (06) 1318-1329
20 Stoyanov SR, Hides L, Kavanagh DJ, Zelenko O, Tjondronegoro D, Mani M. Mobile app rating scale: a new tool for assessing the quality of health mobile apps. JMIR Mhealth Uhealth 2015; 3 (01) e27-e27
21 Schnall R, Cho H, Liu J. Health Information Technology Usability Evaluation Scale (Health-ITUES) for usability assessment of mobile health technology: validation study. JMIR Mhealth Uhealth 2018; 6 (01) e4
22 Casas J, Tricot M-O, Khaled OA. et al. Trends & Methods in Chatbot Evaluation. Paper presented at: Companion Publication of the 2020 International Conference on Multimodal Interaction; 2020 Virtual Event, Netherlands:
23 Langevin R, Lordon RJ, Avrahami T. et al. Heuristic Evaluation of Conversational Agents. Paper presented at: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery; 2021: Article 632
24 Jadeja M, Varia N. Perspectives for evaluating conversational AI. arXiv preprint arXiv:170904734; 2017
25 Peras D. Chatbot Evaluation Metrics: Review Paper. Economic and Social Development (Book of Proceedings). In: Veselica R, Dukić G, Hammes K. eds. Zagreb: Varazdin Development and Entrepreneurship Agency, Varazdin, Croatia; 2018: 89-97
26 Venkatesh A, Khatri C, Ram A. et al. On evaluating and comparing open domain dialog systems. arXiv preprint arXiv:180103625; 2018
27 Atiyah A, Jusoh S, Alghanim F. Evaluation of the Naturalness of Chatbot Applications. Paper presented at: 2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT); 2019: 359-365
28 Shawar BA, Atwell E. Different measurements metrics to evaluate a chatbot system. Paper presented at: Proceedings of the Workshop on Bridging the Gap: Academic and Industrial Research in Dialog Technologies; 2007 Rochester, New York:
29 Chia-Chien H, Brian AS. The Delphi Technique: Use, Considerations, and Applications in the Conventional, Policy, and On-Line Environments. In: Carlos Nunes S. ed. Online Research Methods in Urban and Planning Studies: Design and Outcomes. Hershey, PA: IGI Global; 2012: 173-192
30 Kelders SM, Kok RN, Ossebaard HC, Van Gemert-Pijnen JE. Persuasive system design does matter: a systematic review of adherence to web-based interventions. J Med Internet Res 2012; 14 (06) e152
31 Shum H-y, He X-d, Li D. From Eliza to XiaoIce: challenges and opportunities with social chatbots. Front Inform Technol Electronic Eng 2018; 19: 10-26
32 Avella JR. Delphi panels: research design, procedures, advantages, and challenges. Int J Dr Stud 2016; 11: 305-321
33 Diamond IR, Grant RC, Feldman BM. et al. Defining consensus: a systematic review recommends methodologic criteria for reporting of Delphi studies. J Clin Epidemiol 2014; 67 (04) 401-409
34 Brewer J. Using Combined Expertise to Evaluate Web Accessibility. 2019 . Available at: https://www.w3.org/WAI/test-evaluate/combined-expertise/
35 Radziwill NM, Benton MC. Evaluating quality of chatbots and intelligent conversational agents. arXiv preprint arXiv:170404579; 2017
36 Boulkedid R, Abdoul H, Loustau M, Sibony O, Alberti C. Using and reporting the Delphi method for selecting healthcare quality indicators: a systematic review. PLoS One 2011; 6 (06) e20476
37 Jones J, Hunter D. Consensus methods for medical and health services research. BMJ 1995; 311 (7001): 376-380
38 New Zealand Ministry of Health. HISO 10029:2015 Health Information Security Framework. Wellington: Ministry of Health; 2015
39 Nielsen J. Finding usability problems through heuristic evaluation. Paper presented at: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems; 1992 Monterey, California, USA:
40 Röder M, Both A, Hinneburg A. Exploring the Space of Topic Coherence Measures. Paper presented at: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining; 2015 Shanghai, China:
41 Maroengsit W, Piyakulpinyo T, Phonyiam K, Pongnumkul S, Chaovalit P, Theeramunkong T. A survey on evaluation methods for chatbots. Paper presented at: Proceedings of the 2019 7th International Conference on Information and Education Technology; 2019, March:111–119
42 Bangor A, Kortum PT, Miller JT. An Empirical Evaluation of the System Usability Scale. Int J Hum Comput Interact 2008; 24: 574-594
43 Davis FD. Perceived usefulness, perceived ease of use, and user acceptance of information technology. Manage Inf Syst Q 1989; 13: 319-340
44 Hess GI, Fricker G, Denecke K. Improving and evaluating eMMA's communication skills: a Chatbot for managing medication. Stud Health Technol Inform 2019; 259: 101-104
45 Turunen M, Hakulinen J, Ståhl O. et al. Multimodal and mobile conversational health and fitness companions. Comput Speech Lang 2011; 25: 192-209
46 Martínez-Miranda J, Martínez A, Ramos R. et al. Assessment of users' acceptability of a mobile-based embodied conversational agent for the prevention and detection of suicidal behaviour. J Med Syst 2019; 43 (08) 246

Supplementary Material

Supplementary Material